One important goal of statistical inference is to learn about the data generating process. In the simple case, this can simply mean to find out if a certain independent variable influences our outcome of interest. If we do a randomized experiment, we can just __compare the outcomes conditional on different values of the independent variable_ to find out if the independent variable is part of the data generating process.

Keeping with our experiment example, we can also describe two alternative statistical models (e.g. using regression models) that describe the data generating process, where one model assumes an effect of the independent variable, but the other does not. Then we could compare the two models and declare the model that fits the data better to be the winner. The problem with this approach is that if the addition of one independent variable is the only difference between the two models, the model with more variables will always fit the data better, even if it is not the correct model.

Hence, we need more sophisticated approaches to determine which of two models better capture the data generating process.

Note that we are not interested in finding a model that describes the data we are seeing well, then we could just choose the best fitting model. Instead, we want to use the data at hand to learn about the process that generated the data we see. After all, only knowledge about this process generalizes to other situations.

Overfitting and underfitting

The fact that more complex models always fit the data better is related to the term overfitting. Overfitting means that the fitted model does not only describe the data generating process, but also takes part of the data that is just noise to be part of the data generating model.

For instance, assume that we are looking at the relationship between income and well being, which could have this (simulated) relationship:

N = 1000
set.seed(123456)
x = sort(rnorm(N,mean = 0))
bf.s = splines::bs(x,knots =3)
bf.b = matrix(c(1,1,1,0),ncol = 1)
mu = bf.s %*% bf.b
y = rnorm(N, mu , sd = .1)
par(mfrow = c(1,1))

plot_xy = function(x,y,mu, idx = NULL) {
  if (is.null(idx)) idx = 1:length(x)
      plot(x[idx],y[idx], 
           ylab = "",
           xlab = "",
           xlim = quantile(x,c(.01,.99)), 
           ylim = quantile(y,c(.01,.99)))
      lines(x,mu,col = "red")
    }

plot_xy(x,y,mu)

# N = 500
# x = rnorm(N,mean = 0)
# y = rnorm(N,b1*x - b2*x^2, sd = 1)
# par(mfrow = c(1,1))
# plot(x,y)
# lines(sort(x),b1*sort(x) - b2*sort(x)^2, col = "red")

However, we only have a sample of N = 15 participants to learn about this relationship. The following figure shows different random samples in rows, and predictions of regression models of increasing complexity, from a simple linear model \(\small \mu = \alpha + \beta x\) to a polynomial of the 4th order \(\small \mu = \alpha + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4\), in columns. The red lines show the true relationship and the blue lines show the learned relationships.

plot_fits = function(x,y,mu,sample_size) {
  par(mfrow = c(4,4), mar=c(2,2,.5,.5), mgp=c(1.25,.25,0), tck=-.01)
  for (k in 1:4) {
    idx = sample(N,sample_size)
    
    dt = data.frame(x = x[idx], y = y[idx])
    
    q.model = alist(
      y ~ dnorm(mu,exp(log_sigma)),
      mu <- a + b*x,
      a ~ dnorm(0.5,1),
      b ~ dnorm(0,10),
      log_sigma ~ dnorm(0,1)
    )
    
    plot_xy(x,y,mu,idx)
    plot_quap_preds(q.model,dt,"x", plot = F)
    
    plot_xy(x,y,mu,idx)
    q.model[[2]]  = alist(mu <- a + b[1]*x + b[2]*x^2)[[1]]
    plot_quap_preds(q.model,dt,"x", start.list = list(b=rep(0,5)), plot = F)
    
    plot_xy(x,y,mu,idx)
    q.model[[2]]  = alist(mu <- a + b[1]*x + b[2]*x^2 + b[3]*x^3)[[1]]
    plot_quap_preds(q.model,dt,"x", start.list = list(b=rep(0,5)), plot = F)
    
    plot_xy(x,y,mu,idx)
    q.model[[2]]  = alist(mu <- a + b[1]*x + b[2]*x^2 + b[3]*x^3 + b[4]*x^4)[[1]]
    plot_quap_preds(q.model,dt,"x", start.list = list(b=rep(0,5)), plot = F)
    
  }
}
plot_fits(x,y,mu,sample_size = 15)

This figure shows that while making the model increasingly complex allows us to fit the data ever better (see the \(\small R^2\) values), it does not necessarily allows us to capture the true data generating process better. By looking at model predictions as extreme values for x, we can also see that complex models overfit.

What happens if we increase the sample size to 75?

plot_fits(x,y,mu,sample_size = 75)

The variation in the inferred associations between different samples is now smaller, but we still see overfitting and also underfitting, i.e. the inability of the simplest model to capture the non-linear relationship. This would be better visible with posterior predictive plots, which would show a systematic differences between data and model predictions, i.e. overestimation of \(\small y\) and large and small \(\small x\).

As an aside, we can also see that a polynomial model has difficulties in capturing the true non-linear and monotonic relationship, which was generated with a spline model.

The goal of model comparison is then to identify the model that best describes the data generating process, i.e. that does suffer the least from over of underfitting.

Evaluating accuracy

To identify which model is good, we need some measure of model accuracy, or consistency of model and data.

Imagine, you are working with a drug rehabilitation center and try to evaluate different statistical models that predict if a patient will have a relapse within 4 weeks after leaving the center. Your data look as follows:

set.seed(1234)
treatment.weeks = round(rnorm(15,mean = 3, sd = 1),1); 
dt = data.frame(
  treatment.weeks, 
  p.relapse = round(inv_logit(-treatment.weeks+1.25),2), 
  relapse = (round(inv_logit(-treatment.weeks+1.25),2) > runif(15))*1 )
kable(dt[, c("treatment.weeks", "relapse")]) %>% 
  kable_styling(full_width = F)
treatment.weeks relapse
1.8 0
3.3 0
4.1 0
0.7 1
3.4 0
3.5 0
2.4 1
2.5 0
2.4 0
2.1 0
2.5 0
2.0 0
2.2 0
3.1 0
4.0 0

We can build 2 predictions models, one which takes the number of treatment weeks into account, and one that just assumes that treatment is good and nobody ever has a relapse. Here is the data with predictions from these models:

dt$pred.no_relapse = 0
dt$pred.weeks = 
  glm(dt$relapse~dt$treatment.weeks, family = binomial()) %>% 
  predict(type = "response") %>% 
  round(3)

kable(dt[, c("treatment.weeks", "relapse", "pred.no_relapse", "pred.weeks")]) %>% 
  kable_styling(full_width = F)
treatment.weeks relapse pred.no_relapse pred.weeks
1.8 0 0 0.286
3.3 0 0 0.012
4.1 0 0 0.002
0.7 1 0 0.843
3.4 0 0 0.009
3.5 0 0 0.007
2.4 1 0 0.089
2.5 0 0 0.071
2.4 0 0 0.089
2.1 0 0 0.165
2.5 0 0 0.071
2.0 0 0 0.200
2.2 0 0 0.135
3.1 0 0 0.018
4.0 0 0 0.002

A simple way to calculate accuracy would be to compute the average probability to make a correct prediction. Here are these average probabilities for the two models:

av_acc.no_relapse = 
  (sum(1-dt$pred.no_relapse[dt$relapse == 0]) +   # prob. non-relapse
     sum(dt$pred.no_relapse[dt$relapse == 1])) /  # prob. relapse
  15                                              # divide by N 
av_acc.weeks = 
  (sum(1-dt$pred.weeks[dt$relapse == 0]) + 
   sum(dt$pred.weeks[dt$relapse == 1]))/15

av_acc.no_relapse
## [1] 0.8666667
av_acc.weeks
## [1] 0.8576667

We can see that, by this metric, the model that ignores additional information and the true data generating process performs better. The ability to favor models that are inconsistent with the data generating process is an undesirable property of an accuracy criterion.

Recall that in Bayesian updating we used the joint probability of the data given the model, which is something different than the average probability we just calculated. One intuition of why the joint probability is useful is that because it involves multiplication of all probabilities, it does not allow us to ignore some of the data as long as we predict most of the data well. A related view point is that the averaging allows us to favor models that make impossible predictions, like the probability of relapse is zero for everyone.

Hence, here is the joint probability for our two models:

joint_prob.no_relapse = 
   prod(1-dt$pred.no_relapse[dt$relapse == 0]) *  
     prod(dt$pred.no_relapse[dt$relapse == 1])
joint_prob.weeks = 
  prod(1-dt$pred.weeks[dt$relapse == 0]) * 
   prod(dt$pred.weeks[dt$relapse == 1])
joint_prob.no_relapse
## [1] 0
joint_prob.weeks
## [1] 0.02314252

Now we see that the model that is consistent with the data generating process has a better “score”. Compare the code for the average accuracy and the joint probability to see that the difference is that the former approach sums over probabilities whereas the latter multiplies.

Instead of multiplying probabilities, we can also sum log-probabilities:

joint_log_prob.no_relapse = 
   sum(log(1-dt$pred.no_relapse[dt$relapse == 0])) +  
     sum(log(dt$pred.no_relapse[dt$relapse == 1]))
joint_log_prob.weeks = 
  sum(log(1-dt$pred.weeks[dt$relapse == 0])) + 
   sum(log(dt$pred.weeks[dt$relapse == 1]))
joint_log_prob.no_relapse
## [1] -Inf
joint_log_prob.weeks
## [1] -3.766084

Multiplying probabilities and adding log probabilities leads to the same results that differ only due to numerical problems that arise when multiplying small probabilities:

joint_prob.weeks - exp(joint_log_prob.weeks)
## [1] -6.938894e-18

Information, uncertainty and entropy

The concepts of –information and uncertainty are tightly related: The more information we have, the less uncertain we are.

Measures of information or uncertainty should have following properties:

Entropy

We’ll entropy to display these properties. The entropy of a probabilities encoded in the vector \(\small p\) with length \(\small n\) is defined as:

\[ H(p) = -E\;log(p_i) = \sum_{i=1}^{n} p_i log(p_i) \]

Here, \(\small E\) is the expectation or the weighted mean, where the weights are given by the probability for the events.

In R code we can write:

H = function(p) -sum(p*log(p))

Note that if one \(\small p_i = 1\) the sum of all other \(\small p_i\)s must be zero and entropy / uncertainty becomes zero.

Continuity

Now lets look at the simple case of only two possible events. The bottom panel in the following figure shows on the x-axis the probability of the first event \(\small p_1\) and on the y axis the entropy:

p1 = seq(.01,.99, length.out = 99)
entropy = do.call(
  c,
  lapply(p1, function(p1) H(c(p1,1-p1)))
)
layout(matrix(c(1,2,2), ncol = 1))
par(mar=c(0,3,1,1), mgp=c(1.25,.3,0), tck=-.01)
plot(0,type = "n", xlim = c(0,1), ylim = c(0,1), 
     xlab = "", xaxt = "n",
     ylab = expression(p[1]+p[2]))
polygon(x = c(0,1,1,0), y = c(0,0,1,0), col = "blue")
polygon(x = c(0,1,0,0), y = c(0,1,1,0), col = "red")
text(.75,.25, expression(p[1]), cex = 2, col = "white")
text(.25,.75, expression(p[2]), cex = 2, col = "white")
par(mar=c(3,3,1,1), mgp=c(1.25,.3,0), tck=-.01)
plot(p1,entropy, type = 'l', xlab = expression(p[1]))

The figure shows that when we have two events and assign them equal probability, then uncertainty/entropy are large, whereas uncertainty/entropy become smaller when we assign one of the tow options a higher probability.

All changes in entropy are continuous in changes of \(\small p_1\), i.e. they do not have a step function.

Increasing in number of events

What happens with entropy if we increase the number of events and calculate entropy for uniform distributions over events?

n = 2:8
entropies = 
  do.call(
    c,
    lapply(n, function(n) H(rep(1,n)/n))
  )
names(entropies) = n
barplot(entropies, ylab = "entropy", xlab = "number possible events with equal probability")

As the number of possible events increases, entropy increases.

A related property is that if we start with probabilities for two events, e.g.

P2 = c(A1 = .5, A2 = .5)

and further split one of the two events

P2s = c(A1 = .5,  A2.1 = .25, A2.2 = .25)

than the entropy of the second distribution / model should be larger:

cbind(H(P2), H(P2s))
##           [,1]     [,2]
## [1,] 0.6931472 1.039721

Additivity

Additivity is about the fact that they way we represent the data should not influence how the information measure.

Lets assume we have the following distributions of event types A and B:

P = matrix(c(.125,.375,.125,.375), ncol = 2)
colnames(P) = c("A1","A2")
rownames(P) = c("B1","B2")

P %>% 
  kable() %>% 
  kable_styling(full_width = F)
A1 A2
B1 0.125 0.125
B2 0.375 0.375

Given these cells, we can calculate the margins:

P = addmargins(P)
P %>% 
  kable() %>% 
  kable_styling(full_width = F)
A1 A2 Sum
B1 0.125 0.125 0.25
B2 0.375 0.375 0.75
Sum 0.500 0.500 1.00

And given the margins, we can also calculate the cell entries. For example:

\[ \small \begin{aligned} P(A=1, B=1) & =P(A=1)P(B=1)&\\ & =.5 \cdot .25&\\ & =.125 \end{aligned} \]

Because that we can calculate the table cells from the margins and vice versa, the information / entropy in the cells representation and the margins representation should be the same, i.e.

\[ \small \begin{aligned} P_{cells} = \{&P(A=1,B=1), \\ &P(A=1,B=0), \\&P(A=0,B=1),\\ &P(A=0,B=0)\}\\ P_A = \{&P(A=1),\\ &P(A=0)\} \\ P_B = \{&P(B=1),\\ & P(B=0)\} \\ H(P_{cells}) = &H(P_A) + H(P_B ) \end{aligned} \]

Lets try this out:

H_cells = H(c(.125, .125, .375, .375))
H_margins = H(c(.5,.5)) + H(c(.25,.75))
cbind(H_cells, H_margins)
##       H_cells H_margins
## [1,] 1.255482  1.255482

To summarize, you can keep in mind that \[ entropy = uncertainty = -information \]

Using entropy to measure accuracy

To understand why entropy is useful to measure the accuracy of a model, one can reframe the problem and ask “Does a model contain the same information as the true data generating process?”.

We can talk about information in the (model of the) data generating process and the (inferred) model, because we can calculate the probability of the data both for the true data generating process and the model (what we create with the sim function from rethinking packages).

Lets use our initial example again to make clear what we mean when we measure accuracy:

layout(matrix(c(1,3,2,3), ncol = 2))
plot(x,y)
lines(x, mu, col = "red")
title("DGP and population data")
set.seed(1)
idx = sample(N,25)

dt = data.frame(x = x[idx], y = y[idx])

q.model = alist(
  y ~ dnorm(mu,exp(log_sigma)),
  mu <- a + b[1]*x + b[2]*x^2,
  a ~ dnorm(0.5,1),
  b ~ dnorm(0,10),
  log_sigma ~ dnorm(0,1)
)

plot_xy(x,y,mu, idx)
plot_quap_preds(q.model,dt,"x", plot = F, start.list = list(b = rep(0,3)))
mtext(1,text = "x", line = 2.5)
mtext(2,text = "y", line = 2.5)
title("DGP, sample & inferred DGP / model")

preds = plot_quap_preds(q.model,dt,"x", plot = F,
                       start.list = list(b = rep(0,3)),
                       return.yhat = TRUE)
preds$y = predict(bf.s,newx = preds$x) %*% bf.b 
plot(preds$x, preds$yhat,'l', col = "blue", ylab = "", xlab = "")
mtext(1,text = "x", line = 2.5)
mtext(2,text = "y", line = 2.5)
arrows(x0 = preds$x, y0 = preds$y, 
       y1 = preds$yhat,length = .05, code = 3, 
       col = adjustcolor("purple",alpha = .5))
lines(preds$x, preds$y,'l', col = "red")
lines(preds$x, preds$yhat,'l', col = "blue")
title("Difference between true DGP and candidate model")
legend("topleft", lty = 1, col = c("red","blue"), legend = c("true DGP", "model"), bty = "n")

We can calculate the probability of data according to the true DGP and the inferred model. In our example, the probability of the data given the true DGP is \(\small y_i \sim Normal(1\cdot bs_1(x_i) + 1\cdot bs_2(x_i) + 1\cdot bs_3(x_i) + 0\cdot bs_4(x_i), 0.1)\)1

or in R code:

p = dnorm(y, 1*bs2 + 1*bs2 + 1*bs3 + o*bs4, 0.1)

As before, we can use p to calculate how much information about the data we have in the DPG. Entropy tells us how much information about the observed data we have in our model.

The probability of the data given the inferred model is (using posterior means) \(\small y_i \sim Normal(0.97 + 0.11 \cdot x_i - 0.05 \cdot x_i^2, 0.1)\)

or in R code:

p = dnorm(y, 0.97 + 0.11*x - 0.05*x^2, 0.1)

If we want then to learn about the accuracy of a candidate model, we can compute how much information about the data it has, relative to the true DGP (which is also a model). This differences or deviance in information about the data is called the Kullback-Leibler divergence or KL divergence or short \(\small D_{KL}\).

With probabilities of observed events given the true DGP \(\small p\) and model predicted probabilities \(\small q\) the Kullback-Leibler divergence is defined as: \[ \small \begin{aligned} D_{KL}(p,q) & = \sum_i p_i \,log(p_i) - p_i \, log(q_i) \\ & = \sum_i p_i \, \left(log(p_i) - log(q_i)\right) \\ & = \sum_i p_i \, log \left( \frac{p_i}{q_i} \right) \end{aligned} \]

Here, \(\small p_i \,log(p_i)\) is the entropy and \(\small p_i \,log(q_i)\) is the cross entropy.

Note that if \(\small p = q\) then \(\small log(p_i) - log(q_i) = 0\) and we could say that the candidate model has no additional uncertainty about the data on top of the uncertainty that is already present given the DGP: \(\small D_{KL}(p,q) = 0 \; | \; p = q\). Therefore we can say that Kullback-Leibler divergence is a measure of additional uncertainty due to using a candidate model instead of the true model.

Here is an R function for the Kullback-Leibler divergence:

dKL = function(p,q) sum(p*(log(p)-log(q)))

which we can use to plot the divergence between the distribution p = c(.375,.625) and alternative distributions q:

p = c(.375,.625)
q1 = seq(.01,.99,.01)
DKL_pq = do.call(
  c,
  lapply(q1, function(q1) dKL(p,c(q1,1-q1)))
)
plot(q1,DKL_pq,
     type = 'l',
     ylab = expression(D[KL](p,q)),
     xaxt = "n", bty = "n")
lines(rep(p[1],2), c(0, max(DKL_pq)), lty = 2)
axis(1, pos=0, at=seq(0,1,.25))

Estimating deviance

So far, we have determined that

  • entropy is a good measure of information and
  • KL divergence is a good measure of distance between the true model and the model predictions.

However, we generally do not know the true model, we only have a sample of data that was generated from it. This is not really a problem, because our initial goal was model comparison. So what we really want to do, is to compare the deviances of two models, here \(\small q\) and \(\small r\):

\[\small D_{KL}(p,q) - D_{KL}(p,r)\] Here is how the deviances of the tow models are defined:

\[ D_{KL}(p,q) = \sum_i p_i \,log(p_i) - p_i \, log(q_i) \\ D_{KL}(p,r) = \sum_i p_i \,log(p_i) - p_i \, log(r_i) \]

If we want to compute the relative accuracy of the models, we calculate

\[ relative \; accuracy = \sum_i p_i \,log(p_i) - p_i \, log(q_i) - \sum_i p_i \,log(p_i) - p_i \, log(r_i) \]

Luckily, we can simplify this becuase

  • both deviances sum over \(p_i \,log(p_i)\) and
  • multiply the probability of the data with \(p_i\).

Therefore we can remove these terms and are then in the delightful situation that we only need to know about \(\small \sum_i log(q_i)\) and \(\small \sum_i log(r_i)\) if we want to compare candidate models:

\[ relative \; accuracy = \sum_i log(q_i) - \sum_i log(r_i) \]

where

\[ S(q) = \sum_i log(q_i) \]

is the total score for model \(\small q\).

To see how we can calculate this quantitu for a model, remember that we started with the definition of entropy:

\[ H(p) = -E\;log(p_i) = \sum_{i=1}^{n} p_i log(p_i) \]

where \(\small p_i\) is the probability of observing some data given the model (and its parameters). Where have we heard this before?

\[ \overset{\color{violet}{\text{posterior probability}}}{P(parameter|data)} = \frac{\overset{\color{red}{\text{likelihood}}}{P(data|parameter)} \cdot \overset{\color{blue}{\text{prior probability}}}{P(parameter)}}{\overset{\color{orange}{\text{evidence}}}{P(data)}} \]

This is just the likelihood term we already use when we do Bayesian updating. One important thing when computing this quantity in a Bayesian context is that the \(\small paramter\) is not a single value, but a posterior distribution.

The rethinking package has a function lppd, which calculates for each data point the log of the mean probability of the observation given the model and posterior distribution of parameters, also called the log pointwise predictive density.

Coming back to this example:

q.model = alist(
  y ~ dnorm(mu,exp(log_sigma)),
  mu <- a + b[1]*x + b[2]*x^2,
  a ~ dnorm(0.5,1),
  b ~ dnorm(0,10),
  log_sigma ~ dnorm(0,1)
)
q.fit = quap(q.model, dt, start = list(b = rep(0,2)))

We are calculating

  • for each data point its average probability (averaging over posterior samples)
  • compute the log of this probability
  • sum over the log probabilities of all data points
lppd(q.fit)
##  [1]  1.2081469 -0.4331930  0.3442207  1.2504365  0.8566858  0.9679976
##  [7]  0.8970060  0.9719373  1.0595419  1.1380290 -0.4848278  1.2878312
## [13]  0.8041407  0.5635683  1.2340670  1.2612162 -0.2025854  1.3195827
## [19]  1.2680272  0.6709762  0.8384172  1.0665948  1.1372614  1.1560851
## [25]  1.3052097

Cross validation

So far, we have calculated the divergence for the data we observed. The problem is that this is thus still a divergence in model fit to the observed data. However, what we want to compare is how well to different models are consistent with the data generating process, or the data we can observe from it.

One pragmatic way to go about this is to evaluate the model fit not with the data we used to fit the model parameters (training data) but on a new set of test data which we did not use to estimate parameters.

The following figure shows in addition to 25 data points use to estimated the model parameters 25 new test data points that were not used to estimate model parameter. To estimate the deviance of the model predictions from these tests data.

To visualize that we are using the posterior distribution to calculate deviance, the figure also shows 25 lines (mu), one for each posterior sample. The figure does not show variations in the standard deviation.

par(mar=c(3,3,0.5,0.5), mgp=c(1.25,.5,0), tck=-.01)
plot_xy(x,y,mu, idx)
mtext(1,text = "x", line = 2)
mtext(2,text = "y", line = 2)
set.seed(123)
test.idx = sample(setdiff(1:length(x),idx),25)
mu.test = link(q.fit, data = data.frame(x = c(x[test.idx],seq(-2.5,2.5,length.out = 100))), n = 50)
matlines(seq(-2.5,2.5,length.out = 100),y = t(mu.test[,-(1:25)]), lty = 1, 
         col = adjustcolor("blue", alpha = .25))
os = seq(-.02,.02, length.out = 50)
for (k in 1:50)
  arrows(x[test.idx] + os[k], y0 = y[test.idx], y1 = mu.test[k,1:25], length = 0, col = adjustcolor("purple", alpha = .2))
points(x[test.idx],y[test.idx], col = "red", pch = 16)
legend("topleft",
       pch = c(1,26,0,0),
       lty = c(0,0,1,1),
       col = c("black",adjustcolor("purple", alpha = .2),"blue","red"),
       legend = c("train data","test data","DGP mu","posterior mus"),)
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'

We can use the following figure from the book to explain why deviance in the training data set (cross validation) is a better approach to model comparison:

  • deviation in the test data generally decreases with increasing model complexity, but we don’t want to choose the most complex model
  • deviance on a hold-out test data set is better able to identify the correct model (with 2 parameters)
  • using deviance only on test data is also a problem for larger sample sizes.


  1. \(bs\) are spline basis functions↩︎